In this project we will apply a Fully Bayesian Logistic Regression
model aimed at predicting the onset of diabetes mellitus in Pima Indians
given diagnostic measurements.
Diabetes is a metabolic disorders characterized by a high blood sugar
level (hyperglycemia) over a prolonged period of time. It is mainly due
to the reduction of insulin production or lack of capability by the
cells to properly respond to the insulin produced.
The dataset used is Pima Indians Diabetes Dataset, publicly available on Kaggle : https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database, containing medical information of female patients belonging to Pima Indian heritage of age \(\geq\) 21.
It consist of several medical predictor variables:
Pregnancies: Number of Times Pregnant (Int)
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
BloodPressure: Diastolic blood pressure (mm Hg):
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2):
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Patient Age (Years)
For each patient, the binary target variable Outcome indicates whether or not she is affected by Diabetes.
First, let’s look for missing/inconsistent values in our data.
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | |
|---|---|---|---|---|---|---|---|---|
| Min. | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0780 | 21.0000 |
| 1st Qu. | 1.0000 | 99.0000 | 62.0000 | 0.0000 | 0.0000 | 27.3000 | 0.2438 | 24.0000 |
| Median | 3.0000 | 117.0000 | 72.0000 | 23.0000 | 30.5000 | 32.0000 | 0.3725 | 29.0000 |
| Mean | 3.8451 | 120.8945 | 69.1055 | 20.5365 | 79.7995 | 31.9926 | 0.4719 | 33.2409 |
| 3rd Qu. | 6.0000 | 140.2500 | 80.0000 | 32.0000 | 127.2500 | 36.6000 | 0.6262 | 41.0000 |
| Max. | 17.0000 | 199.0000 | 122.0000 | 99.0000 | 846.0000 | 67.1000 | 2.4200 | 81.0000 |
| Missing Values | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Zeros | 111.0000 | 5.0000 | 35.0000 | 227.0000 | 374.0000 | 11.0000 | 0.0000 | 0.0000 |
 Â
There are no missing data, but several meaningless zero values, that
we can consider as NaN (e.g., BloodPressure = 0). So, we
replaced the zeros with the per-class median in the following
fields:
For what concerns the Skin Thickness and Insulin, the amount of missing values is respectively 0.296 and 0.296. Hence, such replacement would be meaningless. Therefore, we removed those features.
Here it is represented the Box Plot for all the features
In the following, they are reported some plots representing the distribution of the different features. Please notice that the values reported refer to intra-class percentages, i.e., the counts are normalized by dividing for the carnality of the class they refer to. By doing so, we are able to compare the different categories regardless from the class unbalance of the data set.
In this dataset, diabetes seems more common among women that had experienced many pregnancies. Although gestation per se cannot cause diabetes (except for gestational diabetes, which is a specific temporary condition), an high number of pregnancies can lead to an hormonal imbalance or weight gain, which may increase the risk of developing diabetes.
In this dataset, diabetes seems more common across people above age of 35; this is possibly because that Type I diabetes can arise at any age, while Type II diabetes is more common across people of age \(\geq 40\).
Not surprisingly, diabetic people have an higher glycemic index, being this is one of the hillness’ symptoms.
Diabetic people seem to suffer more from hypertension than non-diabetic patients. This may be due to the fact that some forms of diabetes (Type II) are associated with high weight, which on turn may cause high blood pressure.
Diabetes is more common among obese people; indeed obesity is one of the risk factor for developing Type II diabetes.
The features in our dataset are weakly correlated, hence in the regression, there won’t be Multicollinearity problems.
Age and number of Pregancies are weakly Diabeticly correlated (\(\rho \approx 0.5\) ). This is not surprising since elder women are more likely to have had an higher number of Pregnancies during their life.
In order to overcome the severe class imalance of our data, we applied SMOTE: a data augmentation strategy consisting in oversampling the minority class (Diabetic in our case).
Eventually we obtained a dataset with the same statistics as above, but more balanced!
We will use Bayesian Logistic Regression to model the relationship between the binary target variable \(Y \in \{0,1\}\) (diabetic = 1, healthy = 0) with the input features \(x_j\ \ ,\ \ \ j=1,...,p\):
| feature name | variable name |
|---|---|
| Pregnancies | x1 |
| Glucose | x2 |
| BloodPressure | x3 |
| BMI | x4 |
| DiabetesPedigreeFunction | x5 |
| Age | x6 |
| Outcome | Y |
Moreover, in the following we will consider the canonical notation: \(x_0=1\) in order to model the intercept. Differently from Bayesian linear regression, there is no variance term to be estimated, and only the regression parameters (\(\beta_j\)) will be estimated.
For each patient, the target variable can be modeled as a Bernoulli: \[ Y|\boldsymbol{x} \sim Ber(\theta_{\boldsymbol{x}})\] Hence: \[ \Pr(Y = 1|\boldsymbol{x}) = \mathbb{E}[Y = 1|\boldsymbol{x}] = \theta_\boldsymbol{x} \] In logistic regression, the Link function that links \(\mathbb{E}[Y = 1|\boldsymbol{x}]\) with the linear predictor \(\boldsymbol{\beta \cdot x} = \sum_{j} \beta_j \cdot x_j\) is the Logit Function:
\[ \theta_{\boldsymbol{x}} = \text{ilogit}\left(\boldsymbol{\beta \cdot x} \right) \iff \boldsymbol{\beta \cdot x} = \text{logit} \left(\theta_{x}\right) \]
where: \[ \text{ilogit}\left(z\right) = \frac{\exp(z)}{1+\exp(z)} \] is the sigmoid function, forcing \(\boldsymbol{\beta \cdot x_i}\) in the \([0,1]\) range;
and \[\text{logit}(\pi) = \log\left(\frac{\pi}{1-\pi}\right)\] Is the logit function (inverse of the sigmoid) computing the logarithm of the odds.
The overall dataset is randomly shuffled and sliced in Train and Test sets according to a 80/20 split.
Let’s say we have \(n\) observations; we can define their likelihood function as follows: \[ L_{\boldsymbol{y}}(\boldsymbol{\beta}, \boldsymbol{X}) = f(\boldsymbol{y}|\boldsymbol{\beta},\boldsymbol{X}) = \prod_{i= 1}^n f(y^{(i)}|\boldsymbol{\beta,x^{(i)}}) = \prod_{i= 1}^n \text{ilogit}( \boldsymbol{\beta \cdot x^{(i)}}) ^{y^{(i)}}\left(1-\text{ilogit}( \boldsymbol{\beta \cdot x^{(i)}})\right)^{1-y^{(i)}} \]
The joint prior distribution of the parameters on \(\mathbf{B}^p = \{(-\infty, +\infty)\}^{p}\) can be computed as the product of their marginal distributions:
\[ \pi(\boldsymbol{\beta}) = \prod_{j= 1}^{p} \pi(\beta_j) = \pi(\beta_0)\cdot \pi(\beta_1)\cdot ...\cdot \pi(\beta_p) \]
In our case we will consider two kind of priors, expressing our belief that each \(\beta_j\) represents the true population characteristics (under our modellistic assumptions), prior to the observation of any data :
\[ \boldsymbol{\beta} \sim N_p(\boldsymbol{\mu}, \mathbb{I}_p\sigma^2) \iff\beta_j \overset{iid}{\sim} N\left(\mu_j, \sigma_j^2= 10^4\right) \ \ \ j = 1,..., p \]
\[ \beta_j \overset{iid}{\sim} \text{Laplace}\left(\mu_j,b_j = \frac{10^2}{\sqrt2}\right) \ \ \ j = 1,..., p \] For what conernes the hyperparameters, we consider both Normal and Laplace distribution centered in 0 (\(\mu_j = 0 \ , \ \ j = 1,..., p\)) Since we have weak prior knowledge, a diffuse prior distribution is used, spreading more or less evenly the probability distribution over \(\beta_j\), which is modeled in terms of assigning an high variance to the prior:
We will then compare the Deviance Information Criterion of the two models to select the most fitting one.
In Bayesian inference we combine our prior beliefs about the model parameters - expressed in terms of prior distribution - with the likelihood of data. The Posterior distribution express our belief that \(\beta_j\) is the true value, having observed the data. The engine that allow us to obtain the posterior distribution from the prior distribution is the Bayes Rule: \[ \text{Posterior} \propto \text{Prior}\times \text{Likelihood} \]
\[ \pi(\boldsymbol{\beta}|\boldsymbol{y,X}) \propto L_{\boldsymbol{Y}}(\boldsymbol{\beta,X})\times \pi(\boldsymbol{\beta}) = \\ \text{[Gaussian marginal priors]}\\ =\prod_{i= 1}^n \left(\frac{\exp( \boldsymbol{\beta \cdot x^{(i)}})}{1+\exp( \boldsymbol{\beta \cdot x^{(i)}}) }\right)^{y^{(i)}} \left(1- \frac{\exp( \boldsymbol{\beta \cdot x^{(i)}})}{1+\exp( \boldsymbol{\beta \cdot x^{(i)}}) }\right)^{1-y^{(i)}} \prod_{j=0}^{p} {\frac {1}{\sigma_{j} {\sqrt {2\pi }}}}\exp \left(-{\frac {1}{2}}{\frac {(\beta_j-\mu_j )^{2}}{\sigma_{j} ^{2}}}\right) \\ \text{[Laplace marginal priors]}\\ =\prod_{i= 1}^n \left(\frac{\exp( \boldsymbol{\beta \cdot x^{(i)}})}{1+\exp( \boldsymbol{\beta \cdot x^{(i)}}) }\right)^{y^{(i)}} \left(1- \frac{\exp( \boldsymbol{\beta \cdot x^{(i)}})}{1+\exp( \boldsymbol{\beta \cdot x^{(i)}}) }\right)^{1-y^{(i)}} \prod_{j=0}^{p} {\frac {1}{b_{j}}}\exp \left(-{\frac {1}{2}}{\frac {|\beta_j-\mu_j |}{\sigma_{j} ^{2}}}\right) \]
Now, let’s derive the full conditionals In the Gaussian Prior case, this is equal to: \[ \pi(\beta_j|\boldsymbol{y,X,\beta_{(-j)}}) \propto \pi(\boldsymbol{y}|\boldsymbol{X,\beta}) \pi(\beta_j) = \prod_{i= 1}^n \left(\frac{\exp( \boldsymbol{\beta \cdot x^{(i)}})}{1+\exp( \boldsymbol{\beta \cdot x^{(i)}}) }\right)^{y^{(i)}} \left(1- \frac{\exp( \boldsymbol{\beta \cdot x^{(i)}})}{1+\exp( \boldsymbol{\beta \cdot x^{(i)}}) }\right)^{1-y^{(i)}} \cdot {\frac {1}{\sigma_{j} {\sqrt {2\pi }}}}\exp \left(-{\frac {1}{2}}{\frac {(\beta_j-\mu_j )^{2}}{\sigma_{j} ^{2}}}\right) \]
In the Laplace Prior case, this is equal to: \[ \pi(\beta_j|\boldsymbol{y,X,\beta_{(-j)}}) \propto \pi(\boldsymbol{y}|\boldsymbol{X, \beta}) \pi(\beta_j)= \prod_{i=1}^{n} \left(\frac{\exp( \boldsymbol{\beta \cdot x^{(i)}})}{1+\exp( \boldsymbol{\beta \cdot x^{(i)}})}\right)^{y^{(i)}}\left(1-\frac{\exp( \boldsymbol{\beta \cdot x^{(i)}})}{1+\exp(\boldsymbol{\beta \cdot x^{(i)}}) }\right)^{1-y^{(i)}} \cdot{\frac {1}{2b_{j}}}\exp \left(-{\frac {|\beta_j-\mu_{j} |}{b_{j}}}\right) \]
Both distributions do not belong to any known parametric families; hence, it is not possible to directly draw i.i.d. simulations from them in order to provide estimates for \(\boldsymbol{\beta}\); we will hence rely on Markov Chain simulations to obtain estimates for the unknown parameters.
Gibbs sampling is an algorithm aimed to simulate a suitable Markov Chain:
\[ \beta_{j}^{(t+1)} = \pi\left(\beta_{j}|\boldsymbol{\beta_{(-j)}}\right) = \pi\left(\beta_{j} |\beta_0^{(t+1)}, \beta_{1}^{(t+1)}, ..., \beta_{j-1}^{(t+1)}, \beta_{j+1}^{(t)},...,\beta_{p}^{(t)} \right) \]
This per se does not provide a strictly stationary Markov Chain, but:
We can look at each update - which is temporary within the Gibbs cycle - as a transition from one state to another.
\[ \beta_j^{(t+1)} = \pi \left(\beta_j |\boldsymbol{\beta_{(-j)}}\right) \rightarrow K_j \]
Metropolis-Hastings consists in building a Markov Chain by iteratively draw a candidate \(Y_{t+1}=y\) from a proposal distribution \(p_x(y)\) - which depends on the current state of the chain \(X_{t}=x\) - and decide weather accept it as next state of the chain or remain in the current state \(x\). Hence, the next state of the chain: \[ X_{t+1} = \begin{cases} y \ \ \text{ with probability }\ \ \ \ \ \ \ \ \ \alpha(x,y)\\ x \ \ \text{ with probability } \ 1- \alpha(x,y) \end{cases} \] where: \[ \alpha(x,y) = \min\left\{1,\frac{\pi(y)}{\pi(x)}\frac{p_y(x)}{p_x(y)}\right\} \]
Gibbs sampling is a special case of Metropolis algorithm, in which the new proposal is accepted with probability 1. It requires to recognize the full conditionals, which is not our case since any of the \(\beta_j\) belong to a well-known parametric family. In this case we can apply a Metropolis-within-Gibbs algorithm, which consists in replacing the Kernel \(K^{GS}\) from which we are not able to directly simulate in the Gibbs Sampling with the kernel of the Metropolis \(\tilde{K}^{MH}\), having as target density the full conditional of \(\beta_j\).
model
{
for (i in 1:N){
Y[i] ~ dbern(p[i])
p[i] <- ilogit(beta0+beta1*x1[i] + beta2*x2[i] + beta3*x3[i] + beta4*x4[i] + beta5*x5[i] + beta6*x6[i])
}
# Defining the prior beta parameters
beta0 ~ ddexp(0, 1.0E-4)
beta1 ~ ddexp(0, 1.0E-4)
beta2 ~ ddexp(0, 1.0E-4)
beta3 ~ ddexp(0, 1.0E-4)
beta4 ~ ddexp(0, 1.0E-4)
beta5 ~ ddexp(0, 1.0E-4)
beta6 ~ ddexp(0, 1.0E-4)
}
| mean | sd | 2.5% | 25% | 50% | 75% | 97.5% | Rhat | n.eff | |
|---|---|---|---|---|---|---|---|---|---|
| beta0 | -8.6472 | 0.7593 | -10.1110 | -9.1532 | -8.6436 | -8.1331 | -7.2100 | 1.0005 | 3000 |
| beta1 | 0.1168 | 0.0316 | 0.0581 | 0.0951 | 0.1164 | 0.1389 | 0.1787 | 1.0005 | 3000 |
| beta2 | 0.0354 | 0.0034 | 0.0289 | 0.0330 | 0.0354 | 0.0376 | 0.0423 | 1.0008 | 3000 |
| beta3 | -0.0062 | 0.0081 | -0.0221 | -0.0117 | -0.0063 | -0.0007 | 0.0101 | 1.0008 | 3000 |
| beta4 | 0.1027 | 0.0150 | 0.0738 | 0.0924 | 0.1024 | 0.1128 | 0.1325 | 1.0007 | 3000 |
| beta5 | 0.8973 | 0.2855 | 0.3435 | 0.7077 | 0.8944 | 1.0909 | 1.4522 | 1.0019 | 1400 |
| beta6 | 0.0130 | 0.0094 | -0.0048 | 0.0067 | 0.0128 | 0.0193 | 0.0314 | 1.0007 | 3000 |
| deviance | 849.4309 | 3.9556 | 844.1442 | 846.6033 | 848.7129 | 851.5262 | 858.6929 | 1.0007 | 3000 |
| DIC | pD |
|---|---|
| 857.2577 | 7.8269 |
# MODEL SPECIFICATION
model
{
for (i in 1:N){
Y[i] ~ dbern(p[i])
p[i] <- ilogit(beta0+beta1*x1[i] + beta2*x2[i] + beta3*x3[i] + beta4*x4[i] + beta5*x5[i] + beta6*x6[i])
}
# Defining the prior beta parameters
beta0 ~ ddexp(0, 1.0E-4)
beta1 ~ ddexp(0, 1.0E-4)
beta2 ~ ddexp(0, 1.0E-4)
beta3 ~ ddexp(0, 1.0E-4)
beta4 ~ ddexp(0, 1.0E-4)
beta5 ~ ddexp(0, 1.0E-4)
beta6 ~ ddexp(0, 1.0E-4)
}
| mean | sd | 2.5% | 25% | 50% | 75% | 97.5% | Rhat | n.eff | |
|---|---|---|---|---|---|---|---|---|---|
| beta0 | -8.6299 | 0.7467 | -10.0184 | -9.1809 | -8.6562 | -8.0994 | -7.2196 | 1.0119 | 960 |
| beta1 | 0.1165 | 0.0317 | 0.0552 | 0.0957 | 0.1162 | 0.1370 | 0.1792 | 1.0012 | 3000 |
| beta2 | 0.0350 | 0.0033 | 0.0288 | 0.0327 | 0.0350 | 0.0371 | 0.0416 | 1.0049 | 470 |
| beta3 | -0.0061 | 0.0081 | -0.0226 | -0.0113 | -0.0061 | -0.0004 | 0.0092 | 1.0058 | 480 |
| beta4 | 0.1032 | 0.0152 | 0.0732 | 0.0931 | 0.1034 | 0.1133 | 0.1325 | 1.0111 | 590 |
| beta5 | 0.8920 | 0.2866 | 0.3391 | 0.7014 | 0.8790 | 1.0800 | 1.4671 | 1.0013 | 2600 |
| beta6 | 0.0133 | 0.0093 | -0.0044 | 0.0070 | 0.0132 | 0.0192 | 0.0320 | 1.0031 | 3000 |
| deviance | 849.3543 | 3.7506 | 844.0283 | 846.6301 | 848.7282 | 851.3682 | 858.7077 | 1.0055 | 400 |
| DIC | pD |
|---|---|
| 856.3575 | 7.0032 |
The 95%-CI for \(\beta_3\) in both models contains the zero, hence the parameter does not matter in our analysis in the relation between the covariates and the outputs. Let’s have a look to the model if we remove the variable \(x_3\) = Blood Pressure:
| mean | sd | 2.5% | 25% | 50% | 75% | 97.5% | Rhat | n.eff | |
|---|---|---|---|---|---|---|---|---|---|
| beta0 | -8.8640 | 0.6967 | -10.2138 | -9.3367 | -8.8709 | -8.3993 | -7.4989 | 1.0008 | 3000 |
| beta1 | 0.1158 | 0.0314 | 0.0548 | 0.0951 | 0.1150 | 0.1367 | 0.1780 | 1.0008 | 3000 |
| beta2 | 0.0348 | 0.0034 | 0.0283 | 0.0325 | 0.0348 | 0.0371 | 0.0417 | 1.0006 | 3000 |
| beta4 | 0.0995 | 0.0142 | 0.0718 | 0.0898 | 0.0999 | 0.1092 | 0.1275 | 1.0019 | 1400 |
| beta5 | 0.9061 | 0.2868 | 0.3602 | 0.7104 | 0.9000 | 1.0992 | 1.4678 | 1.0023 | 1100 |
| beta6 | 0.0113 | 0.0090 | -0.0060 | 0.0051 | 0.0112 | 0.0175 | 0.0289 | 1.0011 | 3000 |
| deviance | 848.8892 | 3.4988 | 844.1271 | 846.4120 | 848.1873 | 850.6570 | 856.9820 | 1.0036 | 830 |
| DIC | pD |
|---|---|
| 854.9993 | 6.11 |
| mean | sd | 2.5% | 25% | 50% | 75% | 97.5% | Rhat | n.eff | |
|---|---|---|---|---|---|---|---|---|---|
| beta0 | -8.8867 | 0.7055 | -10.3427 | -9.3491 | -8.8694 | -8.4005 | -7.5533 | 1.0098 | 220 |
| beta1 | 0.1151 | 0.0321 | 0.0497 | 0.0936 | 0.1151 | 0.1369 | 0.1783 | 1.0023 | 1100 |
| beta2 | 0.0348 | 0.0034 | 0.0280 | 0.0326 | 0.0349 | 0.0371 | 0.0414 | 1.0042 | 590 |
| beta4 | 0.0998 | 0.0153 | 0.0700 | 0.0894 | 0.0998 | 0.1104 | 0.1299 | 1.0107 | 220 |
| beta5 | 0.9119 | 0.2882 | 0.3601 | 0.7118 | 0.9126 | 1.0993 | 1.4859 | 1.0017 | 3000 |
| beta6 | 0.0116 | 0.0093 | -0.0061 | 0.0054 | 0.0114 | 0.0179 | 0.0297 | 1.0058 | 380 |
| deviance | 849.1369 | 3.6672 | 844.1308 | 846.4201 | 848.5044 | 851.0922 | 858.1590 | 1.0026 | 2000 |
| DIC | pD |
|---|---|
| 855.8589 | 6.722 |
The criterion used for selecting the best model is the Deviance Information Criterion (DIC): a comparative index used for choosing among competing models. It is a measure of goodness of fit of the model (average deviance), discounted by a penalization term,representing the number of parameters (pD):
\[ DIC = p_D+ \hat{D}_{\text{avg}}(\theta) \] where:
\[\hat{D}_{\text{avg}}(\theta) \approx \frac{1}{M} \sum_{i=1}^M -2 \log\left(f\left(y|\theta^{(j)}\right) \right) \]
The model with the smallest DIC assumes Normal Priors and excludes
the variable x3; hence, the following analysis will be
referred to such setup.
Traceplot: The traceplots allow to assess the convergence of the markov chain by showing the time series of the chain. All chains seem to be exploring the same region of parameter values, which is a good sign. The expected outcome is to observe a stationary behavior after a number of transitions, which translates into regular oscillation within a certain range due to the chain reaching the target distribution.
Density plot: Depicts the simulated sampling distribution of the parameters, for each chain. The yellow vertical line represents the empirical mean, while the horizontal one the 95% quantile based credible interval based on the joint results of all of the three chains.
Autocorrelation Function Plot : The Auto Correlation Function(ACF) plots visualize how much the correlation between the simulated values holds in the previous states.
If the process becomes stationary, we expect the correlation to vanish as we increase the lag \(h\). In the limit, if \(\rho=0\) the simulations are iid (which of course is not our case).
One of the main goals of performing MCMC is to approximate quantities which are functionals of the target distribution \(\pi(\cdot)\), such as the empirical mean: \[ I = \mathbb{E}_{\pi}[\beta_{j}] \approx \hat{I_t}= \sum_{i=1}^t\frac{\beta_{i}}{t} \]
Where we assume the sample \(\beta_{j}^{(1)}, ..., \beta_{j}^{(t)}\) to be simulated from suitable Markov Chain: invariant w.r.t. \(\pi(\cdot)\). Such condition is satisfied either (i) by considering as starting distribution the stationary distribution or (ii) - under the regularity conditions make the ergodic behavior (SLL) possible - by taking only the samples following the burn-in time \(T_0\), a suitable time such that \(\beta_{j}^{(T_0-1)} \approx \pi(\cdot)\).
By looking the plots we can see that the line that quickly approaches the overall mean. The point in which this happens is precisely the burn-in time.
As criterion to establish the parameter with the larges the posterior uncertainty, we can look at the width of the quantile-based credible intervals (95%):
| LB | UB | width | |
|---|---|---|---|
| beta0 | -10.3427 | -7.5533 | 2.7894 |
| beta1 | 0.0497 | 0.1783 | 0.1285 |
| beta2 | 0.0280 | 0.0414 | 0.0135 |
| beta4 | 0.0700 | 0.1299 | 0.0599 |
| beta5 | 0.3601 | 1.4859 | 1.1258 |
| beta6 | -0.0061 | 0.0297 | 0.0359 |
| LB | UB | width | |
|---|---|---|---|
| beta0 | -10.3184 | -7.5380 | 2.7804 |
| beta1 | 0.0483 | 0.1760 | 0.1276 |
| beta2 | 0.0280 | 0.0414 | 0.0134 |
| beta4 | 0.0714 | 0.1309 | 0.0595 |
| beta5 | 0.3445 | 1.4715 | 1.1270 |
| beta6 | -0.0063 | 0.0295 | 0.0358 |
As we can notice, in both cases the parameter associated to the widest credible interval is the intercept: \(\beta_0\).
The most correlated couple of parameters is \(\beta_4\) and \(\beta_0\).
The empirical mean is an unbiased estimator; therefore the MSE is equal to its variance. Differently from the Vanilla Monte Carlo case, in which the estimate are computed on iid samples, in the MCMC each realization of the sample depends on the previous one, which on turns depend on its prior and so on.. As a consequence, the variance of the empirical mean of the simulated values cannot be computed as the simulated values divided by the number of simulations, but we must take into account the covariance between each pair of simulated values. The higher the correlation, the larger will be the variance of the approximation provided by the MCMC algorithm with respect to the variance of the iid simulations proper of the MC method. This implies that the Markov Chain is more inefficient than standard iid simulation. This can be quantified by the effective sample size (ESS):
\[ t_{eff}=\frac{t}{\left(1+2\sum_{k=1}^{+\infty} \rho_k \right)} \]
Used to normalize the variance of \(h(\beta_j)\) and get the variance of \(\hat{I}_t\):
\[ \sigma^2_{\hat{I}_t} = \mathbb{V}\text{ar}[\hat{I}_t] =\frac{\mathbb{V}\text{ar}_{\pi}(h(\beta_j))}{t_{eff}} = \frac{1}{\left(1+2\sum_{k=1}^{+\infty} \rho_k \right)} \cdot \frac{\sigma^2}{t} \]
Another estimate of the inaccuracy of the MC samples is the Monte Carlo Standard Error (MCSE). It measures the standard deviation around the posterior mean of the samples due to the uncertainty of the MCMC algorithm.
| ESS | IFiid | MCMCerror | MCSEerror | |
|---|---|---|---|---|
| beta0 | 1000 | 0.0005 | 0.0005 | 0.0216 |
| beta1 | 1000 | 0.0000 | 0.0000 | 0.0010 |
| beta2 | 1000 | 0.0000 | 0.0000 | 0.0001 |
| beta4 | 1000 | 0.0000 | 0.0000 | 0.0004 |
| beta5 | 1000 | 0.0001 | 0.0001 | 0.0091 |
| beta6 | 1000 | 0.0000 | 0.0000 | 0.0003 |
| deviance | 1000 | 0.0122 | 0.0122 | 0.1178 |
In the following they are reported some Convergence Diagnostics, aimed at verifying the presence of converge issues, which might suggest to enlarge the number of simulations or use some other types of parametrizations.
Gelman-Rubin diagnostic evaluates the convergence by analyzing the difference between multiple Markov chains. Let:
For each parameter, we compare the between-chains and within-chain estimated variances; large differences between them indicate nonconvergence.
\[ W_T = \frac{1}{M} \sum_{m=1}^3 \left[ \frac{1}{T} \sum_{t=1}^T \left(h\left(\beta_t^{(m)}\right) -\hat{I}_T^{(m)}\right ) \right] \]
where:
Based on these measures, we compute the Potential Scale Reduction factor: \[ \hat{R_T} = \sqrt{ \frac{T-1}{T} W_T + \frac{B_T}{n}} \]
The model is healthy if \(\hat{R}_T \approx 1\) and is below some threshold (usually \(\hat{R}_T < 1.1\)).
| Point Estimate | Upper CI | |
|---|---|---|
| beta0 | 1.000 | 1.001 |
| beta1 | 1.001 | 1.005 |
| beta2 | 1.000 | 1.002 |
| beta4 | 1.001 | 1.006 |
| beta5 | 1.001 | 1.004 |
| beta6 | 1.001 | 1.004 |
| deviance | 1.003 | 1.009 |
| Multivariate PSRF |
|---|
| 1.005 |
Geweke Diagnostics consists in an inter chain convergence check. Compares the first 20% of the chain after burnin with the half last 50% percent; If the samples are drawn from the stationary distribution of the chain, the two means are equal and Geweke’s statistic has an asymptotically standard normal distribution. The test statistic is a standard Z-score, calculated under the assumption that the two parts of the chain are asymptotically independent: \[ Z = \frac{\hat{\beta}_A-\hat{\beta}_B}{\frac{1}{T_A}\hat{S}^{A}_{\beta}(0)+\frac{1}{T_B}\hat{S}^{B}_{\beta}(0)} \] where \(A\),\(B\) are two windows within the Markov chain and the standard error is estimated from the spectral density at zero and so takes into account any autocorrelation. If \(|Z|<1.96\) we accept the null hypothesis.
| Chain 1 | Chain 2 | Chain 3 | |
|---|---|---|---|
| beta0 | 0.509 | 0.378 | -0.066 |
| beta1 | 0.154 | -0.097 | -1.011 |
| beta2 | -1.221 | -0.114 | -0.821 |
| beta4 | 0.163 | -0.651 | 0.417 |
| beta5 | 0.020 | 0.396 | 0.742 |
| beta6 | 0.642 | -0.546 | 0.846 |
| deviance | -0.019 | -0.547 | -0.022 |
Uses a test statistic to verify the null hypothesis that the values sampled from the Markov Chian come from a stationary distribution. It composes of two parts:
Stationary Test:
Halfwidth Test: If the chain passes the first part of the diagnostic, then the part of the chain that was not discarded from the first part is used to test the second part.
| Stationarity test | start iteration | p-value | Halfwidth test | Mean | Halfwidth | |
|---|---|---|---|---|---|---|
| beta0 | pass | 1 | 0.711 | pass | -8.864 | 0.043 |
| beta1 | pass | 1 | 0.773 | pass | 0.117 | 0.002 |
| beta2 | pass | 1 | 0.462 | pass | 0.035 | 0.000 |
| beta4 | pass | 1 | 0.828 | pass | 0.100 | 0.001 |
| beta5 | pass | 1 | 0.434 | pass | 0.914 | 0.018 |
| beta6 | pass | 1 | 0.288 | pass | 0.011 | 0.001 |
| deviance | pass | 1 | 0.160 | pass | 848.991 | 0.217 |
| Stationarity test | start iteration | p-value | Halfwidth test | Mean | Halfwidth | |
|---|---|---|---|---|---|---|
| beta0 | pass | 1 | 0.986 | pass | -8.880 | 0.040 |
| beta1 | pass | 1 | 0.880 | pass | 0.115 | 0.002 |
| beta2 | pass | 1 | 0.809 | pass | 0.035 | 0.000 |
| beta4 | pass | 1 | 0.806 | pass | 0.100 | 0.001 |
| beta5 | pass | 1 | 0.386 | pass | 0.889 | 0.014 |
| beta6 | pass | 1 | 0.786 | pass | 0.011 | 0.001 |
| deviance | pass | 1 | 0.689 | pass | 848.647 | 0.217 |
| Stationarity test | start iteration | p-value | Halfwidth test | Mean | Halfwidth | |
|---|---|---|---|---|---|---|
| beta0 | pass | 1 | 0.947 | pass | -8.848 | 0.044 |
| beta1 | pass | 1 | 0.230 | pass | 0.116 | 0.002 |
| beta2 | pass | 1 | 0.302 | pass | 0.035 | 0.000 |
| beta4 | pass | 1 | 0.718 | pass | 0.099 | 0.001 |
| beta5 | pass | 1 | 0.632 | pass | 0.915 | 0.018 |
| beta6 | pass | 1 | 0.506 | pass | 0.011 | 0.001 |
| deviance | pass | 1 | 0.965 | pass | 849.030 | 0.228 |
In order to validate our model, we can create, through simulation, a synthetic dataset from a distribution whose parameters (\(\beta\)’s) are fixed and thus, known. Then, we check if it is able to recover the true population parameters by applying it to the simulated data.
| simulation parameters | fixed parameters | ratio | |
|---|---|---|---|
| beta0 | -8.787 | -8.798 | 1.001 |
| beta1 | 0.083 | 0.114 | 1.375 |
| beta2 | 0.035 | 0.035 | 0.990 |
| beta4 | 0.109 | 0.099 | 0.909 |
| beta5 | 0.873 | 0.896 | 1.026 |
| beta6 | 0.006 | 0.011 | 1.783 |
As we can notice, the approximation ratio is \(\approx 1\) for almost all the features. We can conclude that the model is able to recover the true parameter and confirm its adequacy.
We want to test our model by evaluating its predictive performances over new observations (test data);
Here they are reported the results of our Bayesian Logistic Regression Classifier,with the canonical cut-off level = 0.5, compared to the ones obtained by using different Machine Learning models, namely:
| Bayesian_LR | Frequentistic_LR | RandomForest | SVM | |
|---|---|---|---|---|
| Accuracy | 0.7884615 | 0.7740385 | 0.8317308 | 0.7788462 |
| Precision | 0.8155340 | 0.9029126 | 0.8476190 | 0.7961165 |
| Recall | 0.7706422 | 0.7153846 | 0.8240741 | 0.7663551 |
| F1-score | 0.7924528 | 0.7982833 | 0.8356808 | 0.7809524 |
As we can notice, Random Forest is the best performing model; nevertheless, our Bayesian Logistic Regression has an accuracy higher than both Frequentistic Logistic Regression and Support Vector Classifier, despite a lower Precision than the former. An important remark is that we are mainly interested in achieving an high recall, since we want to detect as many Diabetic people as possible to allow them to preventive care; our model overtakes both Frequentistic LR and SVM.
ROC curve for different values of cutoffs \(p\) for classifying observations:
The area under the ROC curve is: 0.882
Mohammed, Ahmed & Hassan, Masoud & Kadir, Dler. (2020). Improving Classification Performance for a Novel Imbalanced Medical Dataset using SMOTE Method. International Journal of Advanced Trends in Computer Science and Engineering. 9. 3161-3172. 10.30534/ijatcse/2020/104932020.
Hassan, Masoud. (2020). A Fully Bayesian Logistic Regression Model for Classification of ZADA Diabetes Dataset. Science Journal of University of Zakho. 8. 105-111. 10.25271/sjuoz.2020.8.3.707.